Relational Data Mining Techniques for Historical Document Processing

نویسندگان

  • Michelangelo Ceci
  • Donato Malerba
چکیده

Document image understanding denotes the recognition of semantically relevant components in the layout extracted from a document image. Automatic approaches for document image understanding are highly demanded today by organizations involved in the preservation and valorisation of historical documents that collect more and more document images, whose effective usage critically depends on their fast and accurate indexing and cataloguing. In this context, Data Mining techniques can be profitably applied in order to support the user in the recognition of semantically relevant components in historical document images. However, such application is not straightforward and two important aspects have to be considered: First, extracted models should take into account the inherent spatial nature of the layout of a document image and spatial relations among layout components of interest. Second, low layout quality and standard of such a material introduces a considerable amount of noise in its description. For this reasons, in this paper, we investigate the application of a Statistical Relational Data Mining method, which successfully allows relations between components to be effectively and naturally represented by resorting to the Relational Data Mining framework and guarantees robustness to noise by exploiting statistical methods. Experiments are performed on two historical document corpora from the 20's and 30's.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Relational Learning

Most of the content-based approaches to text and web document classification explored in other related projects are based on the bag of words model, well known from the area of Information Retrieval. This model is simple and efficient, but fails to capture many additional document features such as the internal HTML structure, language structure and inter-document link structure. All this howeve...

متن کامل

A Digital Humanities Approach to the History of Science - Eugenics Revisited in Hidden Debates by Means of Semantic Text Mining

Comparative historical research on the the intensity, diversity and fluidity of public discourses has been severely hampered by the extraordinary task of manually gathering and processing large sets of opinionated data in news media in different countries. At most 50,000 documents have been systematically studied in a single comparative historical project in the subject area of heredity and eug...

متن کامل

Audio Data Mining Using Multi-perceptron Artificial Neural Network

-Data mining is the activity of analyzing a given set of data. It is the process of finding patterns from large relational databases. Data mining includes: extract, transform, and load transaction data onto the data warehouse system, store and manage the data in a multidimensional database system, provides data, analyze the data by application software and visual presentation. Audio data contai...

متن کامل

Fast Relational Data Mining Query Optimization for Improving the Efficiency of Relational Data Mining Systems

Data mining is the process of building predictive or descriptive models based on a large data set, often stored in a relational database. Propositional data mining systems require that the data is converted into one single table. Relational data mining systems, on the other hand, can build models directly from the relational database. While building a model, relational data mining systems execu...

متن کامل

Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML

As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006